Abstract
Background While AI technologies for synthetic data (SD) generation are well-developed, their direct application in clinical settings remains challenging. Key issues include privacy and security concerns, which necessitate the use of closed, third-party models with sensitive patient data. Furthermore, many generative models lack adequate clinical validation, often failing to understand and replicate the complex correlations between clinical variables. These barriers are particularly acute in rare diseases like β-thalassemia, where data is already scarce. Our work addresses these specific challenges by implementing a secure, transparent platform to generate a high-fidelity synthetic cohort of patients with transfusion-dependent β-thalassemia (TDT).
Aims This work aims: 1) to implement an AI-based SD generation platform within the locally secured, privacy-preserving environment of the Webthal® dataset; 2) to demonstrate its secure integration and efficacy in generating high-quality synthetic clinical data, suitable for clinical research, creation of digital twins and synthetic control arms for clinical trials.
Methods We implemented the TRAIN SD generation platform (www.train-ai.eu) within the Webthal environment. This platform integrates various generative models designed for multimodal data. A selected CT-WGAN was trained on a retrospective cohort of 779 adult (≥18 years) TDT patients. Real-world data were collected from Italian centers using the Webthal® computerized medical record from 2010 to 2019. To assess the quality of the SD, SAFE (Synthetic vAlidation FramEwork) was implemented within the platform to evaluate statistical fidelity, clinical utility, and privacy preservability. SAFE computes several statistical metrics aggregated into Clinical Synthetic Fidelity (CSF) for statistical fidelity and nearest neighbor distance ratio (NNDR) for privacy preservability. For clinical validation, we used the synthetic TDT cohort to replicate the findings of Musallam et al. (PMID:37976447) on the association between pre-transfusion hemoglobin (Hb) levels and mortality, comparing variables distributions and clinical conclusions from SD analysis with those derived from the real cohort. All analyses were conducted in three settings: 1) creating a 1:1 privacy-preserving proxy of the original WebTHAL dataset; 2) augmenting the cohort to twice its size for simulation purposes; and 3) conditionally generating a dataset with specific patient characteristics (deaths by Hb category) to show its flexibility for clinical research.
Results Patient distributions, stratified by pre-transfusion Hb levels, were consistent with the original cohort. The 5- and 10-year survival rates, as well as the unadjusted and adjusted Hazard Ratios for mortality, were comparable between the real and synthetic data. No identical matches to a real patient were found in any of the generated datasets. The replicated dataset (n=779) demonstrated high fidelity (CSF=0.91, NNDR=0.84). The Kaplan-Meier log-rank test for survival stratified by Hb category was highly comparable between the original (chisq=11.6, p=0.02) and synthetic datasets (chisq=12.9, p=0.01). The augmented dataset (n=1558) maintained high fidelity (CSF=0.90, NNDR=0.82). The increased statistical power was evident in the log-rank test, which yielded a higher significant result (chisq=46.1, p<0.001) compared to the original (p=0.02), demonstrating the utility of augmentation for strengthening statistical signals.
The conditionally generated cohort (n=779) showed excellent fidelity (CSF=0.90, NNDR=0.81) and remarkable clinical mimicry. It replicated the statistical outcome of the log-rank test on the original data (synthetic: chisq=11.2, p=0.02. When stratifying by ferritin levels, the found association between mortality and Hb in ≤1000 ng/mL category of original data, was in the >1000 ng/mL category in the replicated datasets and in both in the larger CD cohort.
Conclusions The study demonstrates that a robust, clinically-oriented platform can generate high-quality, useful SD even in a complex setting of rare diseases like β-thalassemia, successfully overcoming the installation and validation limitations. SD replicated key clinical outcomes and statistical properties, enabled effective data augmentation, and allowed for flexible cohort simulation. This validated technology therefore represents a powerful tool to overcome data sharing barriers and accelerate precision medicine research in hematology.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal